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Abstract. This EM review article focuses on parameter expansion, a 
simple technique introduced in the PX-EM algorithm to make EM con- 
verge faster while maintaining its simplicity and stability. The primary 
objective concerns the connection between parameter expansion and 
efficient inference. It reviews the statistical interpretation of the PX- 
EM algorithm, in terms of efficient inference via bias reduction, and 
further unfolds the PX-EM mystery by looking at PX-EM from differ- 
ent perspectives. In addition, it briefly discusses potential applications 
of parameter expansion to statistical inference and the broader impact 
of statistical thinking on understanding and developing other iterative 
optimization algorithms. 

Key words and phrases: EM algorithm, PX-EM algorithm, robit re- 
gression, nonidentifiability. 
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1. INTRODUCTION 

The expectation maximization (EM) algorithm of 
Dempster, Laird and Rubin (1977) has proven to be 
a popular computational method for optimization. 
While simple to implement and stable in its conver- 
gence, the EM algorithm can converge slowly. Many 
variants of the original EM algorithm have also been 
proposed in the last 30+ years in order to over- 
come shortcomings that are sometimes seen in im- 
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plementations of the original method. Among these 
EM-type algorithms are the expectation-conditional 
maximization (ECM) algorithm of Meng and Ru- 
bin (1993), the expectation-conditional maximiza- 
tion either (ECME) algorithm of Liu and Rubin 
(1994), the alternating ECM (AECM) algorithm of 
Meng and van Dyk (1997) and, more recently, the 
dynamic ECME (DECME) algorithm of He and Liu 
(2009). This review article focuses on parameter ex- 
pansion as a way of improving the performance of 
the EM algorithm through a discussion of the pa- 
rameter expansion EM (PX-EM) algorithm proposed 
by Liu, Rubin and Wu (1998). 

The EM algorithm is an iterative algorithm for 
maximum likelihood (ML) estimation from incom- 
plete data. Let X Q b s be the observed data and let 
/(XobsjO) denote the observed-data model with un- 
known parameter 6, where X ^ s G X ^ s and 9 G 0. 
Suppose that the observed-data model can be ob- 
tained from a complete-data model, denoted by 



ff^obs^mis; 0), where X b s G X ^ s , X n 
6> G 0. That is, 



G <r mis , and 



f(X bs]8) 



g(^obs, -^mis! 6) dX B 



Given a starting point 6^ G 0, the EM algorithm 
iterates for t = 0, 1, . . . between the expectation (E) 
step and maximization (M) step: 
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E step. Compute the expected complete-data log- 
likelihood 



[1.1) 



E(\ng(X ohB ,X mia ;0)\X ohs ,O = e®) 



as a function of fls6; and 
M step. Maximize Q(9\9 [ ~ e >) to obtain 



(1.2) 



£ (m) = argmaxQ(0|# (i) ). 
eee 



Two EM examples are given in Section 2. 

Roughly speaking, the E step can be viewed as cre- 
ating a complete-data problem by imputing missing 
values, and the M step can be understood as con- 
ducting a maximum likelihood-based analysis. More 
exactly, for complete-data models belonging to the 
exponential family, the E step imputes the complete- 
data sufficient statistics with their conditional ex- 
pectations given the observed data and the current 
estimate 9^ of the parameter 9. This to some ex- 
tent explains the simplicity of EM. The particular 
choice of (1.1) together with Jensen's inequality im- 
plies monotone convergence of EM. 

The PX-EM algorithm is essentially an EM al- 
gorithm, but it performs inference on a larger full 
model. This model is obtained by introducing extra 
parameters into the complete-data model while pre- 
serving the observed-data sampling model. Section 

3.1 presents the structure used in PX-EM. The the- 
oretical results established by Liu, Rubin and Wu 
(1998) show that PX-EM converges no slower than 
its parent EM. This is somewhat surprising, as it 
is commonly believed that optimization algorithms 
generally converge slower as the number of dimen- 
sions increases. To help understand the behavior of 
PX-EM, Liu, Rubin and Wu (1998) provided a sta- 
tistical interpretation of the PX-M step in terms of 
covariance adjustment. This is reviewed in Section 

3.2 in terms of bias reduction using the example of 
binary regression with a Student-t link (see Mud- 
holkar and George, 1978; Albert and Chib, 1993; 
Liu, 2004), which serves as a simple robust alterna- 
tive to logistic regression and is called robit regres- 
sion by Liu (2004). 

To help further understand why PX-EM can work 
so well, several relevant issues are discussed in Sec- 
tion 4. Section 4.1 provides additional motivation 
behind why PX-EM can improve upon EM or ECM. 
In Section 4.2 we argue that parameter expansion 
can also be used for efficient data augmentation in 



the E step. The resulting EM is effectively the PX- 
EM algorithm. 

In addition to the models discussed here, param- 
eter expansion has now been shown to have com- 
putational advantages in applications such as factor 
analysis (Liu, Rubin and Wu, 1998) and the analy- 
sis of both linear (Gelman et al., 2008) and nonlin- 
ear (Lavielle and Meza, 2007) hierarchical models. 
However, Gelman (2004) shows that parameter ex- 
pansion offers more than a computational method 
to accelerate EM. He points out that parameter ex- 
pansion can be viewed as part of a larger perspec- 
tive on iterative simulation (see Liu and Wu, 1999; 
Meng and van Dyk, 1999; van Dyk and Meng, 2001; 
Liu, 2003) and that it suggests a new family of prior 
distributions in a Bayesian framework discussed by 
Gelman (2006). One example is the folded noncen- 
tral Student-t distribution for between-group vari- 
ance parameters in hierarchical models. This method 
exploits a parameter expansion technique commonly 
used in hierarchical models, and Gelman (2006) shows 
that it can be more robust than the more common 
inverse-gamma prior. Inspired by Gelman (2004), we 
briefly discuss other potential applications of param- 
eter expansion to statistical inference in Section 5. 

2. TWO EM EXAMPLES 

2.1 The Running Example: A Simple 

Poisson-Binomial Mixed- Effects Model 

Consider the complete-data model for the observed 
data A^obs = X and the missing data -X" m i s = Z: 



and 



Z\X ~ Poisson(A) 



X\(Z,X) ~ Binomial (Z, tt), 



where tt E (0, 1) is known and A > is the unknown 
parameter to be estimated. 

The observed-data model f(X; A) is obtained from 
the joint sampling model of (X, Z): 



(2.1) 



g(X,Z;X) 
X z 



Z\ 



Z\ X\(Z-X)\ 



TT 1 — TT) 



where X = 0, 1, . . . , Z, Z = 0, 1, . . . , and A > 0. That 
is, f(X; A) is derived from g(X, Z;X) by integrating 
out the missing data Z as follows: 



f(X;X) 



CO . z 



7T X (1 -Tt) 



z=X 



z\ X\{z-X)\ 



z-X 
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\X„X °° \z-X 



-X 



XI 



(z-xy. 



z=X 



fc =£ _-X (Avr) A _ a ^ [A(l -7T 
k=0 



XI 



x\ 



e -A e A(i-"0 



(A^ -Aw 

Alternatively, one can get the result from the well- 
known fact related to the infinite divisibility of the 
Poisson distribution; namely, if X\ = X and X2 = 
Z — X are independent Poisson random variables 
with rate Ai = Xtt and A2 = A(l — it), then X\ + 
X2 ~ Poisson(Ai + A2) and conditional on X\ + X2, 
Xi ~ Binomial(A"i + X 2 , Ai/(Ai + A 2 )). 

It follows that the observed-data model is X|A ~ 
Poisson(-7rA). Thus, the ML estimate of A has a closed- 
form solution, A = X/tt. This artificial example serves 
two purposes. First, it is easy to illustrate the gen- 
eral EM derivation. Second, we use this example in 
Section 3.3 to show an extreme case in which PX- 
EM can converge dramatically faster than its par- 
ent EM; PX-EM converges in one-step, whereas EM 
converges painfully slowly. 

The complete-data likelihood is given by the joint 
sampling model of (X,Z) found in equation (2.1). 
It follows that the complete-data model belongs to 
the exponential family with sufficient statistic Z for 
A. The complete-data ML estimate of A is given by 



(2.2) 



A, 



Z. 



To derive the E step of EM, the conditional distribu- 
tion of the missing data Z given both the observed 
data and the current estimate of the parameter A is 
used. It is determined as follows: 



h(Z\X, A) 



g(X,Z;X) 



T l T=x9(X,z;X) 

[\{l-v)] z - x /(Z-X)\ 

Y.7=xim-*)Y- x /{z-x)V) 

(Z-X)\ 

Thus, Z|{X,A}~X + Poisson(A(l-7r)). This yields 
E(Z\X,X)=X + X(1-tt). 



Thus, the EM algorithm follows from the discussion 
of Dempster, Laird and Rubin (1977) on exponen- 
tial complete-data models. Specifically, given the up- 
dated estimate A^ at the tth iteration, EM follows 
these two steps: 

E step. Compute Z = E(Z\X, A = A®) = X + X® x 
(1-tt). 

M step. Replace Z in (2.2) with Z to obtain A( t+1 ) = 
Z. 

It is clear that the EM sequence {A^ : t = 0, 1, . . .} 
is given by 

(2.3) A (m) =X + X {t) (l-ir) (t = 0,l,...) 
converging to the ML estimate 
X = X/tt. 

Rewrite (2.3) as 

A(* +1 )-A=(1-7t)(aW-A) 

to produce a closed-form expression for the conver- 
gence rate of EM: 



|A(* +1 ) -A| 



l-vr. 



|A« -A| 

This indicates that EM can be very slow when tt ~ 0. 

2.2 ML Estimation of Robit Regression via EM 

Consider the observed data consisting of n ob- 
servations X b s = {(x{, yi) : i = 1, . . . , n} with a p- 
dimensional covariate vector X{ and binary response 
Ui that takes on values of and 1. The binary regres- 
sion model with Student- i link assumes that, given 
the covariates, the binary responses yi's are inde- 
pendent with the marginal probability distributions 
specified by 



(2.4) 



Pr( yi = l\ Xi ,p) = l- Pr( yi = 0\ Xi , /3) 

= F u (x' i (3) (i = l,...,n), 



where F u (-) denotes the c.d.f. of the Student-t dis- 
tribution with center zero, unit scale and v degrees 
of freedom. With v ~ 7, this model provides a ro- 
bust approximation to the popular logistic regres- 
sion model for binary data analysis. Here we con- 
sider the case with known v. 
The observed-data likelihood 

n 

i=i 
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involves the product of the c.d.f. of the Student-i 
distribution F u (-) evaluated at x^/3 for i = 1, . . . , n. 
The MLE of f$ does not appear to have a closed- 
form solution. Here we consider the EM algorithm 
for finding the MLE of f). 

A complete-data model for implementing EM to 
find the ML estimate of (3 is specified by introducing 
the missing data consisting of independent latent 
variables (r^, Zj) for each i = 1, . . . ,n with 

(2.5) ri|/3 ~Gamma(i//2,i//2) 
and 

(2.6) ^|(Ti,/3)~N(^/3,l/Ti). 
Let 

< 2 - 7 > «={J; if 2 5 o p-i.-.-)- 

Then the marginal distribution of yj is preserved and 
is given by (2.4). The complete-data model belongs 
to the exponential family and has the following suf- 
ficient statistics for /?: 

n n 

(2.8) S TXX i = ^ ^ TjXjXj and iS ra2 = ^ ^ TjXjZ^. 

i=l i=l 

The complete-data ML estimate of /3 is given by 

(2.9) Ax>m — S txx i S TXZ , 

leading to the following EM algorithm. 

Starting with say, j3^> = (0, . . . , 0), EM iter- 
ates for t = 0, 1, . . . with iteration £ + 1 consisting of 
the following E and M steps: 

E step. Compute S TXX > = E(S TXX >\/3 = /3 (4) , A" obs ) and 

S T xz = E(S TXZ \(3 = /3w,X obs ). 
M step. Update the estimate of (3 to obtain (3^ t+1 ^ = 

q-l q 

°txx" Jtxz - 

Let fu(-) denote the p.d.f. of F v (-). The E step can 
be coded by using the following results derived in 
Liu (2004): 

ti = E(Ti\!3 = pV,X oha ) 

(2.10) 

m -(2y l -l)F l/+2 (-(l + 2M l / 2 x' i ^) 
y i -{2y i -l)F v {-x' i ^)) 

(2.11) nzi = E(T iZi \(3 = /3 (t) ,X obs ) = nzu 

where 

Z{ = x^/3^ ^ 

(2y l -l)f u (x' l f3^) 

Vl - (2y t - l)F v+2 (-(l + 2/ l /)V2 x '. /3 (*)) 



for i = 1, . . . ,n. 

However, the EM algorithm can also converge slow- 
ly in this example. This is discussed in Section 3.2, 
where it is shown that PX-EM can greatly improve 
the convergence rate. 

3. THE PX-EM ALGORITHM 

3.1 The Algorithm 

Suppose that the EM complete-data model can 
be embedded in a larger model g*(X \, s , X m i s ; 6*, a) 
with the expanded parameter (9*, a) 66 x A. As- 
sume that the observed-data model is preserved in 
the sense that, for every (S„a)60x A, 

(3.1) /(X obs ;0) = /*(X obs ;0*,a) 

holds for some 9 £ 0, where /*(X obs ; 0*, a) = 
fx . g*(X ohs , X mis ; 9*, a) dX mis . The condition (3.1) 
defines a mapping = R(6*,a), called the reduc- 
tion function, from the expanded parameter space 
x A to the original parameter space 0. For con- 
venience, assume that the expanded parameters are 
represented in such a way that the original complete- 
data and observed-data models are recovered by fix- 
ing a at cko • Formally, assume that there exists a null 
value of a, denoted by ao, such that 9 = R(9,ao) 
for every 9 £ 0. When applied to the parameter- 
expanded complete-data model g^(X \ )S , X m i s ; 9*, a), 
the EM algorithm, called the PX-EM algorithm, cre- 
ates a sequence {(#* , a®)} in x A. In the original 
parameter space 0, PX-EM generates a sequence 
{0(*) = R(9i t) , a W)} and converges no slower than 
the corresponding EM based on 5(X obs , X m j s ; 9); see 
Liu, Rubin and Wu (1998). 

For simplicity and stability, Liu, Rubin and Wu 
(1998) use (9 {t \a ) instead of (#,q w ) for the E step. 
As a result, PX-EM shares with EM its E step and 
modifies its M step by mapping (#i <+1 ^ , c/ t+1 ) ) to the 
original space = R(9^ +1 \ a^ +1 ^). More pre- 

cisely, the PX-EM algorithm is defined by replacing 
the E and M steps of EM with the following: 

PX-E step. Compute 
Q(0*,a\e®,ao) 

= E(lng*(X obs ,X m i S ;0*,a)|X obs ,0* = 9^\ 

a = ao) 

as a function of (9*, a) £0 x A. 
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PX-M step. Find 

(#i m) , ) = arg max Q(0„ , a\9 {t) , a ) 

and update 

e^=R(ei t+l \a^). 

Since it is the ordinary EM applied to the param- 
eter expanded complete-data model, PX-EM shares 
with EM its simplicity and stability. Liu, Rubin and 
Wu (1998) established theoretical results to show 
that PX-EM can converge no slower than EM. Sec- 
tion 3.2 uses the robit regression example to give 
the statistical interpretation of Liu, Rubin and Wu 
(1998) in terms of covariance adjustment. With the 
toy example, Section 3.3 demonstrates that PX-EM 
can be dramatically faster than its parent EM. A 
discussion of why PX-EM can perform better than 
EM is given in Section 4. 

3.2 Efficient Analysis of Imputed Missing Data: 
Robit Regression 

The E step of EM imputes the sufficient statistics 
S TXX i and S TXZ with their expectations based on the 
predictive distribution of the missing (rj,^) data 
conditioned on the observed data X \^ s and j3® , the 
current estimate of (3 at the tth iteration. Had the 
ML estimate of /?, /3, been used to specify the pre- 
dictive distribution, EM would have converged on 
the following M step, which in this case performs 
correct ML inference. We call the predictive distri- 
bution using f3 the correct imputation model. Before 
convergence, that is, f3^ ^ j3, the E step imputes the 
sufficient statistics S TXX i and S TXZ using an incorrect 
imputation model. The M step also uses a wrong 
model since it does not take into account that the 
data were incorrectly imputed based on an assumed 
parameter value (3^ ^ (3. The M step moves the es- 
timate toward /3, but the difference between 
/3(* +1 ) and /3 can be regarded as bias due to the use 
of the (3 {t \ 

The bias induced by the E step can be reduced 
by making use of recognizable discrepancies between 
imputed statistics and their values under the correct 
imputation model. To capture such discrepancies, 
Liu, Rubin and Wu (1998) considered parameters 
that are statistically identified in the complete-data 
model but not in the observed-data model. These 
parameters are fixed at their default values to render 
the observed-data model identifiable. In the context 
of EM for robit regression, these parameters are the 



scale parameters of n and Zi, denoted by a and a. 
In the observed-data model, they take the default 
values ctQ = l and gq = 1 . 

When activated, the extra parameters are esti- 
mated by the M step and these estimates converge 
to the default values to produce ML parameter es- 
timates for the observed data model. Thus, in the 
robit regression model, we identify the default val- 
ues of the extra parameters as MLEs: ocq = a = 1 
and o"o = & = 1. Denote the corresponding EM es- 
timates by and a^ t+l \ The discrepancies be- 
tween (a( t+1 \a( t+1 ') and (a, a) reveal the existence 
of bias induced by the wrong imputation model. 
These discrepancies can be used to adjust the es- 
timate of the parameter of interest, f3, at each it- 
eration. This is exactly what PX-EM is formulated 
to do, and the resulting algorithm converges faster 
than the original EM. 

Formally, the extra parameter (a, a) introduced 
to capture the bias in the imputed values of and 
zi is called the expansion parameter. The complete- 
data model is thus both data-augmented as well as 
parameter-augmented. For correct inference at con- 
vergence, data augmentation is required to preserve 
the observed-data model after integrating out miss- 
ing data. Likewise, parameter expansion needs to 
satisfy the observed-data model preservation condi- 
tion (3.1). In the robit regression model, let (/?*, a, a) 
be the expanded parameter with /3* playing the same 
role as (3 in the original model. The preservation 
condition states that for every expanded parameter 
value (/?*, a, <t), there exists a value of (3 such that 
the sampling model of the y^s obtained from the 
parameter expanded model is the same as the orig- 
inal sampling model given (3. This condition defines 
a mapping /3 = i?(/3*, a, a), the reduction function. 
This reduction function is used in PX-EM to adjust 
the value of (3^ t+1 ^ produced by the M step. 

The detailed implementation of PX-EM for ro- 
bit regression is as follows. The parameter-expanded 
complete-data model is obtained by replacing (2.5) 
and (2.6) with 

(3.2) (Ti/a)\((3* , a, a) ~ Gamma(i//2, u/2) 
and 

(3.3) Zi\(n,^,a,a) ~ N(x-/3*, a 1 In) 

for i = 1, . . . ,n. Routine algebraic operation yields 
the reduction function 

(3 = R({3*,a, a) 

(3.4) 

= {a l ' 2 /a)^ G^ p ;a>0;a>0). 
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The expanded parameterization in (3.2) and (3.3) 
is a natural choice if the missing data are viewed as 
real and a parameterization is sought that provides a 
model that is flexible while preserving the observed 
data model and allowing the original parameteriza- 
tion to be recovered through the reduction function. 
For example, if r% is treated as fixed, the model for 
Zi is a regression model with fixed variance. Adding 
a 2 and a allows the variance of Z\ and the scale of 
Ti to be estimated freely in the expanded model. 

The sufficient statistics for the expanded parame- 
ter (/?*, a, cr) now become 



n 



Ti, 



(3.5) 



i=l 



S TZ 2 



E 

i=i 



TiZ { 



S TT. 7. 



i=l 
n 

= E 

i=i 



TiXiZ; . 



The complete-data ML estimate of /3* is the same as 
that of /3 in the original complete-data model. The 
complete-data ML estimates of a and a are given 
by 



(3.6) 



Or 



a,. 



— S T and 

(S TZ 2 — S TXZ S ,S T 



n 



The PX-EM algorithm is simply an EM applied to 
the parameter expanded complete-data model with 
an M step followed by (or modified to contain) a re- 
duction step. The reduction step uses the reduction 
function to map the estimate in the expanded pa- 
rameter space to the original parameter space. For 
the robit example, PX-EM is obtained by modifying 
the E and M steps as follows. 

PX-E step. This is the same as the E step of EM ex- 
cept for the evaluation of two additional expected 
sufficient statistics: 



S T = E(S T \p = pV,X oh8 ) 



i=i 



and 



S TZ 2 = E(S TZ 2\(3 = {3®, X ohs ) 
= n(u+l) 



n n 



-"E^E f ^ (t) ^-^ (t) )' 

i=l i=l 

where fj's and z%8 are available from the E step 
of EM. 



PX-M step. Compute /3* = S~ XX ,S TXZ , a 2 

(S TZ 2 - S TXZ S~^ X ,S TXZ ), and d* = n _1 5 T and then 
use the reduction to obtain /3(' +1 ) = (6t l J 2 /cr*)/3*. 

For a numerical example, consider the data of 
Finney (1947), which consist of 39 binary responses 
denoting the presence (y = 1) or absence (y = 0) of 
vaso-constriction of the skin of the subjects after in- 
spiration of a volume V of air at inspiration rate 
R. The data were obtained from repeated measure- 
ments on three individual subjects, the numbers of 
observations per subject being 9, 8 and 22. Finney 
(1947) found no evidence of inter-subject variabil- 
ity, treated the data as 39 independent observations, 
and analyzed the data using the probit regression 
model with V and R in the logarithm scale as covari- 
ates. This data set was also analyzed by Liu (2004) 
to illustrate robit regression. Due to three outlying 
observations, the MLE of the degrees of freedom v 
is very small, z> = 0.11. 

Here we use this data set with ln(U) and ln(R) as 
the covariates and take the fixed v = 2 as a numer- 
ical example to compare EM and PX-EM. Numer- 
ical results comparing the rates of convergence of 
EM and PX-EM are displayed in Figure 1. PX-EM 
shows a clear and dramatic convergence gain over 
EM. For convenience we choose to report the de- 
tailed results over iterations. The algorithms were 
coded in R, which makes CPU comparison unreli- 
able. Since extra computation for the PX-EM imple- 
mentation is minor, we believe the same conclusion 
holds in terms of CPU times. 

3.3 PX-EM with Fast Convergence: 
The Toy Example 

The model X\(Z,X) ~ Binomial(Z, tt) may not fit 
the imputed value of missing data Z very well in the 
sense that X/Z is quite different from tt. This mis- 
match can be used to adjust A (m) . To adjust A (m) , 
we activate tt and let a denote the activated pa- 
rameter with oq = tt. Now the parameter-expanded 
complete-data model becomes 



and 



Z|(A*,a) ~ Poisson(A*) 



X\(Z, A*, a) ~ Binomial^, a) 



where A* > and a G (0,1). If the missing data 
were treated as being observed, this model allows 
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(a) 



(d) 





(b) 



(e) 





(c) 



(f) 





Fig. 1. EM (solid) and PX-EM (dashed) sequences of the regression coefficients /?o (a), /3i (b), (3-2 (c), and log-likeli- 
hood in the robit regression with x — (l,ln(V),ln(i?)). The rates of convergence of EM (solid) and PX-EM (dashed) are 
shown in (e) by — — £^°°'\, where t-*' denotes the log-likelihood value at the tth iteration, and in (f) by 

1/3. 



(t+i) 



P^^W -$°°>\ for j = 0,1 and 2. 



3f 



? (oo) 



the mean parameters for both X and Z to be esti- 
mated. The two corresponding observed-data mod- 
els are Poisson(Avr) and Poisson(A*a), giving the re- 
duction function 

a 

(3.7) A = i2(A*,a) = -A*. 

7T 

The complete-data sufficient statistics are Z and 
X. The complete-data ML estimates of A* and a are 



given by 

X 

(3.8) K,com = Z and a com = —. 

The resulting PX-EM has the following E and M 
steps: 

PX-E step. This is the same as the E step of EM. 
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PX-M step. Replace Z in (3.8) with Z to obtain 
a!' +1) = Z and = X/Z. Update A using the 

reduction function and obtain 

7rZ VT 

The PX-EM algorithm in this case converges in one 
step. Although artificial, this toy example shows again 
that PX-EM can converge dramatically faster than 
its parent EM. 

4. UNFOLDING THE MYSTERY OF PX-EM 

The statistical interpretation in terms of covari- 
ance adjustment, explained by the robit example 
above and the Student-t example in Liu, Rubin and 
Wu (1998), and the theoretical results of Liu, Ru- 
bin and Wu (1998) help reveal the PX-EM magic. 
To further unfold the mystery of PX-EM, we discuss 
the nonidentifiability of expanded parameters in the 
observed-data model in Section 4.1 and take a look 
at PX-EM from the point of view of efficient data 
augmentation in Section 4.2. 

4.1 Nonidentifiability of Expanded Parameters 
and Applicability of PX-EM 

It is often the case in PX-EM that, even though 
the expanded parameter (8*, a) is identifiable from 
Q(6^,a\6^ ,ao) (the expected parameter-expanded 
complete-data log-likelihood), it is not identifiable 
from the corresponding observed-data loglikelihood 

L* (6>* , a) = In /* ( A" obs ; 0* , a) . 

It is helpful to consider £*(#*, a) for understand- 
ing PX-EM, as the PX-M step directly increases 
a) through maximizing Q(6*,a\0®,ato). Nat- 
urally, from a mathematical point of view, unfolding 
the actual likelihood in the larger or expanded pa- 
rameter space Q x A shows how PX-EM steps can 
lead to increases in the likelihood function faster 
than can moves in the original space 0. 

4.1.1 The observed-data log-likelihood surface over 
x A The observed-data log-likelihood, as a func- 
tion of (6**, a), is determined by the actual log-likeli- 
hood L{9) = ln/(-Xob s ;0) with 6 replaced by 6 = 
R(9*,a) so that 

(4.1) L*(9*,a) = L(R{9*,a)) ((0*, a) 6 X A). 

Thus, each point 9 € corresponds to a subspace 
{(6**, a) G x A,R(0*,a) =9}, over which L*(6>*,a) 
is constant. 




InO) 



Fig. 2. Perspective plots of the parameter-expanded ob- 
served-data log-likelihood L(\*,a) (top) and the parameter- 
expanded complete-data log-likelihood Q(X»,a\X^') (bottom) 
in the toy example with X = 8, ty = 0.25, and A'*' = 8. 

For example, when 9 and a are one-dimensional 
parameters, L{9) can be represented by a curve in 
the two-dimensional space x L(Q), whereas L*(9*, 
a) is a family of curves indexed by a. The family 
of curves L*(8*,a) form a surface in the style of a 
mountain range in the three-dimensional space x 
A x L(0). For the toy example, this is depicted in 
Figure 2 by the top panel 3-D perspective plot and 
in Figure 3 by the image with dashed contours or 
"elevation" lines. The mode of L{9) now becomes 
a set of modes of the same "altitude," one for each 
fixed a. That is, the mode of L(6) is expanded into 
the "ridge" shown, for example, by the thick line in 
Figure 3. 

4.1.2 Likelihood maximization in PX-EM The E 
step in PX-EM implicitly computes a family of ex- 
pected complete-data log-likelihood functions, which 
are the Q-functions used in (1.1), over the origi- 
nal parameter space indexed by the expansion pa- 
rameter a. This is because PX-EM introduces no 
additional or different missing data in the larger 
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Fig. 3. PX-EM for the toy example with X = 8, tt = 0.25, 
and A^*' = 8. 77ie parameter- expanded observed-data log-like- 
lihood function Z/(A*,a) is shown by shading and dashed con- 
tours with a maximum along the ridge indicated by a solid 
thick line. The expected parameter- expanded complete-data 
log-likelihood Q(A», a|A^') is shown by the ellipse-like solid 
contours. In this example, maximization of Q(A«, a|A'*') over 
(A,, a) can be obtained in two conditional maximization steps, 
labeled as the two ECM updates. The PX-M step moves to 
a point on the ridge of Z/(A»,a), and the subsequent reduc- 
tion-step moves this point along the the ridge of L(\*,a) to 
the point with a = n . 

complete-data model. In other words, the parent E 
step effectively computes a surface over Q x A that 
can be used as a Q-function to approximate the ex- 
panded loglikelihood L*(6*,a) defined in (4.1). This 
Q-function for the toy example is shown in Figure 
2 by the bottom panel 3-D perspective plot and in 
Figure 3 by the nearly-elliptical contours. For this 
one-step convergence PX-EM example, the mode of 
this Q-function is on the ridge of the expanded log- 
likelihood L*(9*,oi). We note that this is typically 
not the case in more realistic examples. In the gen- 
eral case, the mode of the Q-function would typically 
be located on one elevation line that is higher than 
the elevation line where the update (6^\ao) found 
by EM is located. 

Somewhat surprisingly, any such Q-function for 
each fixed a is EM-valid. By EM-valid, we mean 
that increasing the Q-function results in an increase 
of the actual likelihood in the expanded space and 
thereby in the original space after the reduction step. 
This is due to two facts: (i) the joint Q-function is 
EM-valid for L*(8*,a) and, thus, for L(9) as well, 



and (ii) an M step with any fixed a, which finds 
ei t+1) = argmaxQ(0*,a|0^,a o ), 

followed by the reduction = R(6* +1 \ a) is sim- 

ply a conditional maximization step. Additionally, 
in the context of the ECM algorithm of Meng and 
Rubin (1993), the parent EM is an incomplete ECM 
with only one single CM step over x A. This re- 
lationship is explored in greater detail in the next 
section. 

4.1.3 PX-EM vs. (efficient) ECM over S x A In 
theory, PX-EM has a single M step over the entire 
space G x A. Note that 

max Q(#*, a\6^\ ao) = maxmaxQ(#*, a\6^ , ao)- 

(0,,a) a 6», 

When 

# m) =argmaxQ(0*,a|0(*\a o ) 

does not depends on a, as is often the case in many 
PX-EM examples, the PX-M step is equivalent to 
a cycle of two CM steps: one is the M step of EM, 
and the other updates a with 6* fixed at . This 
version of ECM for the toy example is illustrated in 
Figure 3. In this case, ECM is efficient for it gener- 
ates the PX-EM update. 

To summarize, denote by ECM^ a6t j the above ver- 
sion of ECM over x A. Typically, the algorithms 
can then be ordered in terms of performance as 

(4.2) EM r< ECM {aA} r< PX-EM 

over 6 x A. It should be noted that by typically, we 
mean the conclusion is reached in an analogy with 
comparing the EM algorithm and the Generalized 
EM algorithm (GEM) (Dempster, Laird and Rubin, 
1977), that is, EM typically converges faster than 
GEM, but counter examples exist; see, for exam- 
ple, Section 5.4 of van Dyk and Meng (2010) and 
the alternative explanation from an ECME point 
of view in Section 4.3 of Liu and Rubin (1998) on 
why ECM can be faster than EM. To elaborate it 
further with our robit example, it may be also in- 
teresting to note that when the reduction function 
(3.4) is modified by replacing the adjustment fac- 
tor (a 1 / 2 /a) with (a/a), a typo made in the ear- 
lier versions of the PX-EM for the robit regression 
model, the resulting (wrong) PX-EM converges ac- 
tually faster than the (correct) PX-EM for the nu- 
merical example in Section 3.2. In general, more ef- 
ficiency can be gained by replacing the CM step of 
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ECM over a with a CM step maximizing the corres- 
ponding actual constrained likelihood in the param- 
eter expanded space. This is effectively a parameter- 
expanded ECME algorithm; see such an example 
for the Student-t distribution given in Liu (1997). 
More discussion on ECME and other state-of-the- 
art methods for accelerating the EM algorithm can 
be found in He and Liu (2009). Their discussion 
on the method termed SOR provides a relevant ex- 
planation why the above wrong PX-EM and other 
wrong PX-EM versions, such as the one using the 
wrong reduction function (3 = (a/a 2 )f3* in the nu- 
merical robit example, can converge faster than the 
correct PX-EM. 

Perhaps most importantly, the above discussion 
further explains why PX-EM can perform better 
than EM can, and unfolds the mystery of PX-EM, 
in addition to the covariance adjustment interpreta- 
tion. 

4.2 Efficient Data Augmentation via Parameter 
Expansion 

Meng and van Dyk (1997) consider efficient data 
augmentation for creating fast converging algorithms. 
They search for efficient augmenting schemes by work- 
ing with the fraction of missing-data information. 
Here we show that PX-EM can also be viewed as 
an alternative way of doing efficient data augmenta- 
tion. Unlike Meng and van Dyk (1997), who find a 
fixed augmenting scheme that works for all EM iter- 
ations, the following procedure is a way to choose an 
adaptive augmenting scheme for each EM iteration. 
Rather than control the fraction of missing-data in- 
formation, this procedure reduces bias through the 
expansion parameter. For the sake of clarity, we use 
the artificial example of Section 2.1 to make our ar- 
gument. 

Consider the parameter-expanded complete-data 
likelihood obtained from (2.1) by activating ao = tt, 
that is, 



plays the role of an ancillary statistic; see Ghosh, 
Reid and Fraser (2010) for an introduction to ancil- 
lary statistics. With the correct imputation model, 
or at convergence, the imputed value Z satisfies 



(4.3) 



Z\ 



Z\ X\{Z-X)\ 



a 



,z-x 



(A*>0;0<a<l) 



which has the canonical representation 

h(X, Z) C {K, a ) e Zln[A.(l-a)]+XW(l-a) 

(A*>0;0<a<l). 

Thus, when fixed at the given value, tt, for identi- 
fiability, the complete-data ML estimate a = X/Z 



TT 



X 
Z 



or Z 



X 



TT 



Thus, we can consider modifying the E step of EM to 
produce an imputed statistic Z that satisfies (4.3). 

In the context of PX-EM, the current estimate A^ 
corresponds to the following subset of the expanded 
parameter space: 



(4.4) 



nf> = {(A*, a) : i?(A*, a) = R{X^,a )} 
= {(A*,a):A W 7r = aA*}. 



Thus, we can use the imputation model defined by 
the parameter-expanded complete-data model con- 
ditioned on an arbitrary point (A*, a) G . For ef- 
ficient data augmentation, we choose a particular 
point (A*, a) G fi£ , if it exists, so that (4.3) holds. 
Since 

Z = E(Z\X, A*,a) = X + A*(l-a), 
to obtain the desired imputation model, we solve 

X + A*(l-a) = — , 

TT 

A (t) vr = QA, 

for (A*, a). This system of equations has the solution 



A* — X- 



TT 



+ 



A (t) 



TT 



TT 



and 



a 



A"((l-7r)/7r) + A(*)7r' 



The E step of the EM algorithm based on the corre- 
sponding imputation model produces Z = X/ir. The 
following M step of EM gives A( t+1 ) =Z = X/tt. 

The resulting EM algorithm is effectively the PX- 
EM algorithm. This implies that PX-EM can be 
understood from the perspective of efficient data 
augmentation via parameter expansion. Similar ar- 
guments can be made for other PX-EM examples 
having imputed ancillary statistics. In the general 
case, such an efficient data augmentation amounts to 
modifying imputed complete-data sufficient statis- 
tics and can be viewed as re-imputation of missing 
sufficient statistics. 
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5. DISCUSSION 

Gelman (2004) notes that "progress in statisti- 
cal computation often leads to advances in statisti- 
cal modeling," which opens our eyes to the broader 
picture. Statistical interpretations of EM and PX- 
EM reveal that statistical thinking can aid in un- 
derstanding and developing iterative algorithms. It 
seems natural to apply fundamental concepts from 
statistical inference to address statistical problems 
such as ML estimation and Bayesian estimation (see, 
e.g., Liu and Wu, 1999; van Dyk and Meng, 2001; Qi 
and Jaakkola, 2007; Hobert and Marchev, 2008). A 
recent example is the work of Yu and Meng (2008, 
2010), which uses relationships motivated by the 
concepts of ancillarity and sufficiency in order to 
find optimal parameterizations for data augmenta- 
tion algorithms used in Bayesian inference. However, 
statistical thinking can also be helpful for general- 
purpose optimization algorithms such as in the im- 
provements to the quasi-Newton algorithm devel- 
oped by Liu and Vander Wiel (2007). 

Thinking outside the box, here we briefly discuss 
other potential applications of parameter expansion 
to statistical inference. The fundamental idea of PX- 
EM — the use of expanded parameters to capture in- 
formation in data — leads immediately to a possible 
application of parameter expansion for "dimension- 
matching" in Fisher's conditional inference and fidu- 
cial inference (see, e.g., Fisher, 1973), where difficul- 
ties arise when the dimensionality of the minimal 
sufficient statistics is larger than the number of free 
parameters to be inferred. It is well known that, 
while attempting to build a solid foundation for sta- 
tistical inference, the ideas behind Fisher's fiducial 
inference have not been well developed. Neverthe- 
less, it is expected that parameter expansion can 
be useful in developing new ideas for statistical in- 
ference. For example, a Dempster-Shafer or fiducial- 
like method, called the inferential model (IM) frame- 
work, has been proposed by Zhang and Liu (2011) 
and Martin, Zhang and Liu (2010). Of particular 
interest is the parameter expansion technique pro- 
posed by Martin, Hwang and Liu (2010) for what 
they call weak marginal inference using IMs. Us- 
ing this parameter expansion technique, they pro- 
vide satisfactory resolutions to the famous Stein's 
paradox and the Behrens-Fisher problem. 

Although brief, the above discussion shows that 
parameter expansion has the potential to contribute 
to a variety of applications in computation and sta- 
tistical inference. To conclude this review article, we 



speculate on one possible application of parameter 
expansion to the method of maximum likelihood for 
which the EM algorithm has proven to be a use- 
ful computational tool. The prospect of applying 
general statistical ideas to computational problems 
has also led us to thinking about model checking 
or goodness of fit to solve the unbounded likelihood 
problem in fitting Student-t and mixture models, 
for which EM is often the first choice. In the case 
with unbounded likelihood functions, for example, a 
high-likelihood model may not fit the observed data 
well and then inferential results can be nonsensical. 
It would be interesting to see if the general idea of 
parameter expansion for efficient inference can be 
extended for "valid inference" as well. However, it 
is not our intention here to discuss these open prob- 
lems in depth. Based on the past success in this area, 
it can be expected that parameter expansion meth- 
ods will continue to aid computation and inference. 
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